Skip to content

Comments

fix: reduce memory overflow from checkpoint reads and writes (#344)#571

Open
svarlamov wants to merge 5 commits intomainfrom
devin/1771643297-memory-overflow-fixes
Open

fix: reduce memory overflow from checkpoint reads and writes (#344)#571
svarlamov wants to merge 5 commits intomainfrom
devin/1771643297-memory-overflow-fixes

Conversation

@svarlamov
Copy link
Member

@svarlamov svarlamov commented Feb 21, 2026

fix: reduce memory overflow from checkpoint reads and writes (#344)

Summary

Addresses the runaway memory usage (30-60GB) reported in #344 by fixing the highest-impact patterns in checkpoint I/O:

1. Streaming prune-and-rewrite on append (repo_storage.rs): append_checkpoint previously read ALL checkpoints into memory, appended one, pruned, then wrote ALL back — O(N) memory per append. Now it streams existing checkpoints one-at-a-time through a BufReader, prunes char-level attributions for files superseded by the new checkpoint, writes to a temp file via BufWriter, appends the new checkpoint, then atomically renames. Peak memory is one checkpoint + the new checkpoint, rather than all checkpoints. The file stays small throughout long agent loops because pruning happens on every append.

2. Eliminate redundant full reads (checkpoint.rs): A single checkpoint::run() call previously triggered 4+ independent read_all_checkpoints() deserializations of the entire JSONL file. Now checkpoints are read once at the top of run() and passed through to get_all_tracked_files via a new preloaded_checkpoints parameter.

3. Streaming reads (repo_storage.rs): read_all_checkpoints now uses BufReader line-by-line instead of fs::read_to_string, avoiding holding the full file string and parsed structs in memory simultaneously.

4. BufWriter for writes (repo_storage.rs): write_all_checkpoints now streams serialization through BufWriter instead of building a full string in memory. An explicit flush() call ensures write errors are propagated rather than silently dropped on BufWriter::drop.

All 38 checkpoint-related unit tests pass (31 checkpoint tests + 7 repo_storage tests). No new dependencies added.

Updates since last revision

Major rework: The initial approach deferred char-level attribution pruning to write_all_checkpoints, but this left un-pruned data in the file during the intra-commit loop where memory issues are worst. The new approach prunes on every append_checkpoint call using a streaming read-modify-write pattern that keeps only one checkpoint in memory at a time. The prune_old_char_attributions method was removed entirely (logic inlined into append_checkpoint).

Other changes:

  • write_all_checkpoints signature reverted to &[Checkpoint] (no longer needs &mut since pruning moved to append)
  • Added explicit writer.flush()?; in both append_checkpoint and write_all_checkpoints to ensure I/O errors are propagated

Review & Testing Checklist for Human

  • Streaming prune correctness: The new append_checkpoint assumes the checkpoint being appended is the newest for its files (clears attributions from older entries with matching files). This should always be true since we append chronologically, but verify no code path appends out-of-order or writes un-pruned checkpoints via write_all_checkpoints directly.
  • has_no_ai_edits logic equivalence: The early-exit check in checkpoint::run() was rewritten from all_ai_touched_files().is_empty() to checkpoints.iter().all(|cp| cp.entries.is_empty() || cp.kind != AiAgent/AiTab). These should be logically equivalent but the double-negative is easy to get wrong — worth a careful trace through both code paths.
  • Real-world validation: Test with a repo that has a large checkpoint file (>100MB) and multiple agent sessions. Verify memory usage stays reasonable during git commit and that attributions are correctly preserved end-to-end. Unit tests validate correctness but not the memory improvement.
  • Temp file cleanup: append_checkpoint writes to checkpoints.jsonl.tmp then renames. If the process crashes mid-write, the temp file is left behind (harmless but clutters .git/ai/). Consider if cleanup is needed.

Notes

  • get_all_tracked_files gained an optional preloaded_checkpoints parameter. Existing callers that don't pass it will still work (reads from disk as before).
  • No changes to checkpoint format or serialization — purely I/O optimization.
  • The streaming prune approach means write_all_checkpoints no longer prunes. If any code path writes un-pruned data via write_all_checkpoints, it won't be pruned until the next append_checkpoint.

Link to Devin run: https://app.devin.ai/sessions/2a46b6eaa71f4f46913488bef2ff52a1
Requested by: @svarlamov


Open with Devin

- Make append_checkpoint truly append-only (O(1) instead of O(N) read-write-all)
- Use BufReader for streaming JSONL reads instead of read_to_string
- Eliminate 3 redundant read_all_checkpoints() calls in checkpoint::run()
- Pass pre-loaded checkpoints to get_all_tracked_files
- Defer char-level attribution pruning to write_all_checkpoints
- Use BufWriter for efficient checkpoint serialization

Addresses #344

Co-Authored-By: Sasha Varlamov <sasha@sashavarlamov.com>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@git-ai-cloud-dev
Copy link

No AI authorship found for these commits. Please install git-ai to start tracking AI generated code in your commits.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

devin-ai-integration[bot]

This comment was marked as resolved.

Co-Authored-By: Sasha Varlamov <sasha@sashavarlamov.com>
@svarlamov
Copy link
Member Author

devin review devin review's feedback

devin-ai-integration bot and others added 3 commits February 21, 2026 05:10
…ent data loss

Co-Authored-By: Sasha Varlamov <sasha@sashavarlamov.com>
…during agent loops

Co-Authored-By: Sasha Varlamov <sasha@sashavarlamov.com>
…_checkpoint)

Co-Authored-By: Sasha Varlamov <sasha@sashavarlamov.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines +164 to +169
let mut checkpoints = if reset {
working_log.reset_working_log()?;
Vec::new()
} else {
working_log.read_all_checkpoints()?
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Reordering reset before early-exit causes data loss when reset=true and is_pre_commit=true

When reset=true and is_pre_commit=true, the new code destroys the working log (checkpoints, blobs, INITIAL file) at line 165 before the early-exit check at lines 177-195. After the reset, checkpoints is empty, so has_no_ai_edits is always true (.all() on empty iterator returns true). Additionally, reset_working_log() deletes the INITIAL file (src/git/repo_storage.rs:213-215), so has_initial_attributions is false. Unless inter_commit_move is enabled, the early exit fires and the function returns Ok((0, 0, 0)) — the working log is destroyed but no fresh checkpoint is created.

Root cause: operation reordering

In the old code, the early-exit check at lines 161-178 (LEFT) ran BEFORE the reset at lines 281-287 (LEFT). If the old data had AI checkpoints, all_ai_touched_files() would return non-empty, has_no_ai_edits would be false, and the early exit would NOT fire. The reset would then happen, and the function would proceed to create a fresh checkpoint.

In the new code, the reset happens first at line 164-166 (RIGHT), clearing all checkpoints, blobs, and the INITIAL file. Then the early-exit check at lines 177-195 sees empty checkpoints and an empty INITIAL file, so it almost always fires.

Old behavior with reset=true, is_pre_commit=true, old data had AI edits:

  • Early exit does NOT fire (old data has AI files)
  • Reset happens
  • Fresh checkpoint is created ✓

New behavior:

  • Reset happens (old data destroyed)
  • Early exit fires (empty checkpoints → has_no_ai_edits=true)
  • Returns (0, 0, 0) — no new checkpoint, data lost ✗

Impact: If reset=true is ever combined with is_pre_commit=true, AI attribution data is permanently destroyed without replacement.

Prompt for agents
In src/commands/checkpoint.rs, the block that reads/resets checkpoints (lines 164-174) was moved before the is_pre_commit early-exit check (lines 176-196). This changes behavior when reset=true and is_pre_commit=true: the working log is destroyed before the early-exit fires, causing data loss. To fix this, move the reset/read block back after the early-exit check, or restructure so that when reset=true the early-exit is skipped. The simplest fix: move lines 164-174 (the read_checkpoints_start block including the reset) to after the is_pre_commit early exit block (after line 196), and for the early-exit check use a separate lightweight read (e.g. working_log.all_ai_touched_files() as before, or read checkpoints once for the early-exit check and then conditionally reset afterward).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants